CS4132 Data Analytics
Summary of Research Questions & Results
1. Which teams have won the most games and which teams scored the most per game?
2. Which strategies work best for scoring and winning games, and what are the most used strategies?
1. Which teams have won the most games and which teams scored the most per game?
2. Which strategies work best for scoring and winning games, and what are the most used strategies?
American football is the most popular sport in the United States, with many football events taking place every year, such as Super Bowl and NFL. Almost 100 million people tuned in to watch the 2019 Super Bowl. Its popularity is steadily growing both in the US and overseas. I aim to study the various statistics related to football leagues, such as the winrates and watch rates of different football matches, as well as the patterns in the players themselves and how they change over time, and whether these trends differ across various contexts, such as leagues and post season games.
http://nflsavant.com/pbp_data.php?year=2021, http://nflsavant.com/pbp_data.php?year=2020, http://nflsavant.com/pbp_data.php?year=2019, http://nflsavant.com/pbp_data.php?year=2018, http://nflsavant.com/pbp_data.php?year=2017, http://nflsavant.com/pbp_data.php?year=2016, http://nflsavant.com/pbp_data.php?year=2015, http://nflsavant.com/pbp_data.php?year=2014, http://nflsavant.com/pbp_data.php?year=2013, (NFL Play by Play data for years 2013 to 2021)
http://nflsavant.com/dump/combine.csv?year=2015, http://nflsavant.com/dump/players_2013-12-12.csv, http://nflsavant.com/dump/weather_20131231.csv Player, weather and miscellaneous data
https://github.com/ryurko/nflscrapR-data/tree/master/games_data https://github.com/ryurko/nflscrapR-data/tree/master/legacy_data https://github.com/ryurko/nflscrapR-data/tree/master/play_by_play_data https://github.com/ryurko/nflscrapR-data/tree/master/roster_data (Game and Roster data for years 2009 to 2019)
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Football_League/National_Football_League_team_abbreviations (List of NFL Team Abbreviations)
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import glob
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import squarify
from pywaffle import Waffle
import joypy
import plotly.express as px
# Dataset 1: Combined data of all games played from 2009 to 2019 seasons
all_games = glob.glob("games_data/*/*.csv")
df_ag = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in all_games))
df_ag = df_ag.reset_index(drop=True)
df_ag.head(5)
| type | game_id | home_team | away_team | week | season | state_of_game | game_url | home_score | away_score | Unnamed: 0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | post | 2010010901 | CIN | NYJ | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 14.0 | 24.0 | NaN |
| 1 | post | 2010010900 | DAL | PHI | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 34.0 | 14.0 | NaN |
| 2 | post | 2010011001 | NE | BAL | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 14.0 | 33.0 | NaN |
| 3 | post | 2010011000 | ARI | GB | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 51.0 | 45.0 | NaN |
| 4 | post | 2010011601 | NO | ARI | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 45.0 | 14.0 | NaN |
# Dataset 2: Combined play by play data of all games played from 2013 to 2021
pbp = glob.glob("play_by_play_data/all/*.csv")
df_pbp = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in pbp))
df_pbp.head(5)
| GameId | GameDate | Quarter | Minute | Second | OffenseTeam | DefenseTeam | Down | ToGo | YardLine | ... | IsTwoPointConversion | IsTwoPointConversionSuccessful | RushDirection | YardLineFixed | YardLineDirection | IsPenaltyAccepted | PenaltyTeam | IsNoPlay | PenaltyType | PenaltyYards | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013091512 | 2013-09-15 | 4 | 5 | 9 | JAX | LV | 4 | 13 | 16 | ... | 0.0 | 0 | NaN | 16 | OWN | 1.0 | OAK | 0.0 | OFFENSIVE HOLDING | 10.0 |
| 1 | 2013091512 | 2013-09-15 | 4 | 0 | 36 | JAX | LV | 2 | 10 | 65 | ... | 0.0 | 0 | NaN | 35 | OPP | 0.0 | NaN | 0.0 | NaN | 0.0 |
| 2 | 2013091512 | 2013-09-15 | 4 | 5 | 21 | JAX | LV | 3 | 6 | 23 | ... | 0.0 | 0 | NaN | 23 | OWN | 0.0 | NaN | 0.0 | NaN | 0.0 |
| 3 | 2013091512 | 2013-09-15 | 4 | 5 | 24 | JAX | LV | 2 | 6 | 23 | ... | 0.0 | 0 | NaN | 23 | OWN | 0.0 | NaN | 0.0 | NaN | 0.0 |
| 4 | 2013091512 | 2013-09-15 | 4 | 5 | 55 | JAX | LV | 1 | 10 | 19 | ... | 0.0 | 0 | NaN | 19 | OWN | 0.0 | NaN | 0.0 | NaN | 0.0 |
5 rows × 45 columns
# Dataset 3: All post-season games data from 2009 to 2019
pbp_post = glob.glob("play_by_play_data/post_season/*.csv")
df_pbpp = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in pbp_post))
df_pbpp.head(5)
| play_id | game_id | home_team | away_team | posteam | posteam_type | defteam | side_of_field | yardline_100 | game_date | ... | penalty_player_name | penalty_yards | replay_or_challenge | replay_or_challenge_result | penalty_type | defensive_two_point_attempt | defensive_two_point_conv | defensive_extra_point_attempt | defensive_extra_point_conv | Unnamed: 0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | 30.0 | 2010-01-09 | ... | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 1 | 61 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | 36.0 | 2010-01-09 | ... | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 2 | 82 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | 30.0 | 2010-01-09 | ... | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 3 | 104 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | 30.0 | 2010-01-09 | ... | NaN | NaN | 1 | upheld | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 4 | 166 | 2010010901 | CIN | NYJ | NYJ | away | CIN | NYJ | 74.0 | 2010-01-09 | ... | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
5 rows × 257 columns
# Dataset 4: All pre-season games data from 2009 to 2019
pbp_pre = glob.glob("play_by_play_data/pre_season/*.csv")
df_pbppr = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in pbp_pre))
df_pbppr.head(5)
| play_id | game_id | home_team | away_team | posteam | posteam_type | defteam | side_of_field | yardline_100 | game_date | ... | penalty_player_id | penalty_player_name | penalty_yards | replay_or_challenge | replay_or_challenge_result | penalty_type | defensive_two_point_attempt | defensive_two_point_conv | defensive_extra_point_attempt | defensive_extra_point_conv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 37 | 2009080950 | TEN | BUF | TEN | home | BUF | BUF | 30.0 | 2009-08-09 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 59 | 2009080950 | TEN | BUF | TEN | home | BUF | TEN | 74.0 | 2009-08-09 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 83 | 2009080950 | TEN | BUF | TEN | home | BUF | TEN | 64.0 | 2009-08-09 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 104 | 2009080950 | TEN | BUF | TEN | home | BUF | TEN | 67.0 | 2009-08-09 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 128 | 2009080950 | TEN | BUF | TEN | home | BUF | TEN | 55.0 | 2009-08-09 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 256 columns
# Dataset 5: All regular season games data from 2009 to 2019
pbp_reg = glob.glob("play_by_play_data/regular_season/*.csv")
df_pbpr = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in pbp_reg))
df_pbpr.head(5)
| play_id | game_id | home_team | away_team | posteam | posteam_type | defteam | side_of_field | yardline_100 | game_date | ... | penalty_player_id | penalty_player_name | penalty_yards | replay_or_challenge | replay_or_challenge_result | penalty_type | defensive_two_point_attempt | defensive_two_point_conv | defensive_extra_point_attempt | defensive_extra_point_conv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 46 | 2009091000 | PIT | TEN | PIT | home | TEN | TEN | 30.0 | 2009-09-10 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 68 | 2009091000 | PIT | TEN | PIT | home | TEN | PIT | 58.0 | 2009-09-10 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 92 | 2009091000 | PIT | TEN | PIT | home | TEN | PIT | 53.0 | 2009-09-10 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 113 | 2009091000 | PIT | TEN | PIT | home | TEN | PIT | 56.0 | 2009-09-10 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 139 | 2009091000 | PIT | TEN | PIT | home | TEN | PIT | 56.0 | 2009-09-10 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 256 columns
# Dataset 6: Player post-season roster data from 2009 to 2018
rost_post = glob.glob("roster_data/post_season/*.csv")
df_rp = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in rost_post))
df_rp.head(5)
| season | season_type | full_player_name | abbr_player_name | team | position | gsis_id | |
|---|---|---|---|---|---|---|---|
| 0 | 2009 | post | Aaron Rodgers | A.Rodgers | GB | QB | 00-0023459 |
| 1 | 2009 | post | Adrian Peterson | A.Peterson | MIN | RB | 00-0025394 |
| 2 | 2009 | post | Ahman Green | A.Green | GB | RB | 00-0006305 |
| 3 | 2009 | post | Alex Smith | A.Smith | PHI | TE | 00-0023506 |
| 4 | 2009 | post | Andre Caldwell | A.Caldwell | CIN | WR | 00-0026237 |
# Dataset 7: Player pre-season roster data from 2009 to 2019
rost_pre = glob.glob("roster_data/pre_season/*.csv")
df_rpr = pd.concat((pd.read_csv(f, on_bad_lines='skip') for f in rost_pre))
df_rpr.head(5)
| season | season_type | full_player_name | abbr_player_name | team | position | gsis_id | |
|---|---|---|---|---|---|---|---|
| 0 | 2009 | pre | Aaron Kelly | A.Kelly | ATL | WR | 00-0026504 |
| 1 | 2009 | pre | Aaron Walker | A.Walker | CLE | TE | 00-0022060 |
| 2 | 2009 | pre | Adam DiMichele | A.DiMichele | PHI | QB | 00-0026895 |
| 3 | 2009 | pre | Alex Brink | A.Brink | HOU | QB | 00-0026363 |
| 4 | 2009 | pre | Alex Mortensen | A.Mortensen | TEN | QB | \tGSIS ID: |
# Dataset 8: Player regular season roster data from 2009 to 2019
rost_reg = glob.glob("roster_data/regular_season/*.csv")
df_rr = pd.concat((pd.read_csv(f) for f in rost_reg))
df_rr.head(5)
| season | season_type | full_player_name | abbr_player_name | team | position | gsis_id | |
|---|---|---|---|---|---|---|---|
| 0 | 2009 | reg | Aaron Brown | A.Brown | DET | RB | 00-0027129 |
| 1 | 2009 | reg | Aaron Rodgers | A.Rodgers | GB | QB | 00-0023459 |
| 2 | 2009 | reg | Aaron Stecker | A.Stecker | ATL | RB | 00-0015601 |
| 3 | 2009 | reg | Adrian Peterson | A.Peterson | CHI | RB | 00-0021306 |
| 4 | 2009 | reg | Adrian Peterson | A.Peterson | MIN | RB | 00-0025394 |
# Dataset 9: Pass, rush and receive data from 2019 to 2022
prr = pd.read_csv("misc/nfl_pass_rush_receive_raw_data.csv")
prr.head(5)
| game_id | player_id | pos | player | team | pass_cmp | pass_att | pass_yds | pass_td | pass_int | ... | OT | Roof | Surface | Temperature | Humidity | Wind_Speed | Vegas_Line | Vegas_Favorite | Over_Under | game_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201909050chi | RodgAa00 | QB | Aaron Rodgers | GNB | 18 | 30 | 203 | 1 | 0 | ... | False | outdoors | grass | 65 | 69 | 10 | -3.5 | CHI | 47.0 | 5/9/2019 |
| 1 | 201909050chi | JoneAa00 | RB | Aaron Jones | GNB | 0 | 0 | 0 | 0 | 0 | ... | False | outdoors | grass | 65 | 69 | 10 | -3.5 | CHI | 47.0 | 5/9/2019 |
| 2 | 201909050chi | ValdMa00 | WR | Marquez Valdes-Scantling | GNB | 0 | 0 | 0 | 0 | 0 | ... | False | outdoors | grass | 65 | 69 | 10 | -3.5 | CHI | 47.0 | 5/9/2019 |
| 3 | 201909050chi | AdamDa01 | WR | Davante Adams | GNB | 0 | 0 | 0 | 0 | 0 | ... | False | outdoors | grass | 65 | 69 | 10 | -3.5 | CHI | 47.0 | 5/9/2019 |
| 4 | 201909050chi | GrahJi00 | TE | Jimmy Graham | GNB | 0 | 0 | 0 | 0 | 0 | ... | False | outdoors | grass | 65 | 69 | 10 | -3.5 | CHI | 47.0 | 5/9/2019 |
5 rows × 69 columns
# Dataset 10: Details of players who played from 1920 to 2013
players = pd.read_csv("misc/players_2013-12-12.csv")
players.head(5)
| name | first_name | last_name | birth_city | birth_state | birth_country | birth_date | college | draft_team | draft_round | ... | draft_year | position | height | weight | death_date | death_city | death_state | death_country | year_start | year_end | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Isaako Aaitui | Isaako | Aaitui | NaN | NaN | NaN | 1/25/1987 | NaN | NaN | NaN | ... | NaN | NT | 4-Jun | 315.0 | NaN | NaN | NaN | NaN | 2013 | 2013 |
| 1 | Faye Abbott | Faye | Abbott | Clearport | OH | USA | 1895-08-16 | Syracuse | NaN | NaN | ... | NaN | BB-FB-TB-QB-WB- | 8-May | 182.0 | 1/22/1965 | Dayton | OH | NaN | 1921 | 1929 |
| 2 | Vince Abbott | Vince | Abbott | London | NaN | England | 5/31/1958 | Washington | NaN | NaN | ... | NaN | K | 11-May | 207.0 | NaN | NaN | NaN | NaN | 1921 | 1929 |
| 3 | Duke Abbruzzi | Duke | Abbruzzi | Warren | RI | USA | 8/3/1917 | Rhode Island | NaN | NaN | ... | NaN | HB-DB | 10-May | 175.0 | 12/6/1982 | Newport | RI | NaN | 1921 | 1929 |
| 4 | Karim Abdul-Jabbar | Karim | Abdul-Jabbar | Los Angeles | CA | USA | 6/28/1974 | UCLA | Miami Dolphins | 3rd | ... | 1996.0 | RB | 10-May | 194.0 | NaN | NaN | NaN | NaN | 1996 | 2000 |
5 rows × 21 columns
# Dataset 11: Weather data from 1960 to 2013
weather = pd.read_csv("misc/weather_20131231.csv")
weather.head(5)
| id | home_team | home_score | away_team | away_score | temperature | wind_chill | humidity | wind_mph | weather | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 196009230ram | Los Angeles Rams | 21 | St. Louis Cardinals | 43 | 66 | NaN | 78% | 8.0 | 66 degrees- relative humidity 78%- wind 8 mph | 9/23/1960 |
| 1 | 196009240dal | Dallas Cowboys | 28 | Pittsburgh Steelers | 35 | 72 | NaN | 80% | 16.0 | 72 degrees- relative humidity 80%- wind 16 mph | 9/24/1960 |
| 2 | 196009250gnb | Green Bay Packers | 14 | Chicago Bears | 17 | 60 | NaN | 76% | 13.0 | 60 degrees- relative humidity 76%- wind 13 mph | 9/25/1960 |
| 3 | 196009250sfo | San Francisco 49ers | 19 | New York Giants | 21 | 72 | NaN | 44% | 10.0 | 72 degrees- relative humidity 44%- wind 10 mph | 9/25/1960 |
| 4 | 196009250clt | Baltimore Colts | 20 | Washington Redskins | 0 | 62 | NaN | 80% | 9.0 | 62 degrees- relative humidity 80%- wind 9 mph | 9/25/1960 |
The relevant columns are renamed in each dataset:
columns = ["GameId","GameDate","OffenseTeam","DefenseTeam","Down","ToGo","YardLine", "Description", "PlayType", "IsIncomplete", "IsSack", "IsTouchdown", "IsFumble", "PenaltyTeam", "PenaltyType", "PenaltyYards"]
df_pbpp = df_pbpp.drop("Unnamed: 0", axis=1)
pbp_combined_raw = pd.concat([df_pbpp, df_pbppr, df_pbpr], ignore_index=True)
pbp_combined_raw.extra_point_attempt = pbp_combined_raw.extra_point_attempt.shift(-1)
pbp_combined_raw.two_point_attempt = pbp_combined_raw.two_point_attempt.shift(-1)
pbp_combined_raw.head(5)
| play_id | game_id | home_team | away_team | posteam | posteam_type | defteam | side_of_field | yardline_100 | game_date | ... | penalty_player_id | penalty_player_name | penalty_yards | replay_or_challenge | replay_or_challenge_result | penalty_type | defensive_two_point_attempt | defensive_two_point_conv | defensive_extra_point_attempt | defensive_extra_point_conv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | 30.0 | 2010-01-09 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 61 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | 36.0 | 2010-01-09 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 82 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | 30.0 | 2010-01-09 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 104 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | 30.0 | 2010-01-09 | ... | NaN | NaN | NaN | 1 | upheld | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 166 | 2010010901 | CIN | NYJ | NYJ | away | CIN | NYJ | 74.0 | 2010-01-09 | ... | NaN | NaN | NaN | 0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 256 columns
As the team name abbreviations are inconsistent across the datasets and the full names are not provided in some of said inconsistent datasets, each team's abbreviation is manually matched with the team name. Note that the current team name is used in case of team name changes in the data timeframe.
teams = {
"ARZ": "Arizona Cardinals",
"ARI": "Arizona Cardinals",
"CRD": "Arizona Cardinals",
"ATL": "Atlanta Falcons",
"BLT": "Baltimore Ravens",
"BAL": "Baltimore Ravens",
"RAV": "Baltimore Ravens",
"BUF": "Buffalo Bills",
"CAR": "Carolina Panthers",
"CHI": "Chicago Bears",
"CIN": "Cincinnati Bengals",
"CLV": "Cleveland Browns",
"CLE": "Cleveland Browns",
"DAL": "Dallas Cowboys",
"DEN": "Denver Broncos",
"DET": "Detroit Lions",
"GB": "Green Bay Packers",
"GNB": "Green Bay Packers",
"HST": "Houston Texans",
"HOU": "Houston Texans",
"HTX": "Houston Texans",
"IND": "Indianapolis Colts",
"CLT": "Indianapolis Colts",
"JAC": "Jacksonville Jaguars",
"JAX": "Jacksonville Jaguars",
"KC": "Kansas City Chiefs",
"KAN": "Kansas City Chiefs",
"LV": "Las Vegas Raiders",
"LVR": "Las Vegas Raiders",
"RAI": "Las Vegas Raiders",
"OAK": "Las Vegas Raiders",
"SD": "Los Angeles Chargers",
"SDG": "Los Angeles Chargers",
"LAC": "Los Angeles Chargers",
"STL": "Los Angeles Rams",
"LAR": "Los Angeles Rams",
"RAM": "Los Angeles Rams",
"LA": "Los Angeles Rams",
"MIA": "Miami Dolphins",
"MIN": "Minnesota Vikings",
"NE": "New England Patriots",
"NWE": "New England Patriots",
"NO": "New Orleans Saints",
"NOR": "New Orleans Saints",
"NYG": "New York Giants",
"NYJ": "New York Jets",
"PHI": "Philadelphia Eagles",
"PIT": "Pittsburgh Steelers",
"SF": "San Francisco 49ers",
"SFO": "San Francisco 49ers",
"SEA": "Seattle Seahawks",
"TAM": "Tampa Bay Buccaneers",
"TB": "Tampa Bay Buccaneers",
"TEN": "Tennessee Titans",
"OTI": "Tennessee Titans",
"WAS": "Washington Commanders",
"APR": "American Football Conference",
"NPR": "National Football Conference",
"RIC": "Team Rice",
"SAN": "Team Sanders",
"IRV": "Team Irvin",
"CRT": "Team Carter"
}
Redundant columns are dropped from the players dataset:
players = players[["name","birth_country", "draft_year", "weight", "position", "year_start", "year_end"]]
players.head(5)
| name | birth_country | draft_year | weight | position | year_start | year_end | |
|---|---|---|---|---|---|---|---|
| 0 | Isaako Aaitui | NaN | NaN | 315.0 | NT | 2013 | 2013 |
| 1 | Faye Abbott | USA | NaN | 182.0 | BB-FB-TB-QB-WB- | 1921 | 1929 |
| 2 | Vince Abbott | England | NaN | 207.0 | K | 1921 | 1929 |
| 3 | Duke Abbruzzi | USA | NaN | 175.0 | HB-DB | 1921 | 1929 |
| 4 | Karim Abdul-Jabbar | USA | 1996.0 | 194.0 | RB | 1996 | 2000 |
The humidity data is converted to integer to make graphing easier.
def clean_humidity(string):
return int(string.replace('%',""))
weather.humidity[weather.humidity.notna()] = weather.humidity[weather.humidity.notna()].apply(clean_humidity)
id_str = weather.id.str[-3:].str.upper()
weather.insert(1, "team_abbr", value=id_str)
weather.id = weather.id.str[:-3]
weather.id = pd.to_numeric(weather.id)
weather.head(5)
| id | team_abbr | home_team | home_score | away_team | away_score | temperature | wind_chill | humidity | wind_mph | weather | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 196009230 | RAM | Los Angeles Rams | 21 | St. Louis Cardinals | 43 | 66 | NaN | 78.0 | 8.0 | 66 degrees- relative humidity 78%- wind 8 mph | 9/23/1960 |
| 1 | 196009240 | DAL | Dallas Cowboys | 28 | Pittsburgh Steelers | 35 | 72 | NaN | 80.0 | 16.0 | 72 degrees- relative humidity 80%- wind 16 mph | 9/24/1960 |
| 2 | 196009250 | GNB | Green Bay Packers | 14 | Chicago Bears | 17 | 60 | NaN | 76.0 | 13.0 | 60 degrees- relative humidity 76%- wind 13 mph | 9/25/1960 |
| 3 | 196009250 | SFO | San Francisco 49ers | 19 | New York Giants | 21 | 72 | NaN | 44.0 | 10.0 | 72 degrees- relative humidity 44%- wind 10 mph | 9/25/1960 |
| 4 | 196009250 | CLT | Baltimore Colts | 20 | Washington Redskins | 0 | 62 | NaN | 80.0 | 9.0 | 62 degrees- relative humidity 80%- wind 9 mph | 9/25/1960 |
The rosters for each season type are combined and grouped by season and season type. Data with corrupted GSIS ID values are removed.
rosters = pd.concat([df_rp, df_rpr, df_rr], ignore_index=True)
rosters.sort_values(by=['season', 'season_type'])
rosters = rosters[rosters.gsis_id != "\tGSIS ID: "]
rosters.head(5)
| season | season_type | full_player_name | abbr_player_name | team | position | gsis_id | |
|---|---|---|---|---|---|---|---|
| 0 | 2009 | post | Aaron Rodgers | A.Rodgers | GB | QB | 00-0023459 |
| 1 | 2009 | post | Adrian Peterson | A.Peterson | MIN | RB | 00-0025394 |
| 2 | 2009 | post | Ahman Green | A.Green | GB | RB | 00-0006305 |
| 3 | 2009 | post | Alex Smith | A.Smith | PHI | TE | 00-0023506 |
| 4 | 2009 | post | Andre Caldwell | A.Caldwell | CIN | WR | 00-0026237 |
Lastly, the position abbreviations in the roster dataset is manually replaced as the full names are not given anywhere in the dataset. This will allow us to more clearly understand which position a player is playing as later on.
positions = {
"QB": "Quarterback",
"RB": "Running Back",
"WR": "Wide Receiver",
"TE": "Tight End",
"FB": "Full Back"
}
rosters["position"] = rosters["position"].map(positions)
rosters.head(5)
| season | season_type | full_player_name | abbr_player_name | team | position | gsis_id | |
|---|---|---|---|---|---|---|---|
| 0 | 2009 | post | Aaron Rodgers | A.Rodgers | GB | Quarterback | 00-0023459 |
| 1 | 2009 | post | Adrian Peterson | A.Peterson | MIN | Running Back | 00-0025394 |
| 2 | 2009 | post | Ahman Green | A.Green | GB | Running Back | 00-0006305 |
| 3 | 2009 | post | Alex Smith | A.Smith | PHI | Tight End | 00-0023506 |
| 4 | 2009 | post | Andre Caldwell | A.Caldwell | CIN | Wide Receiver | 00-0026237 |
To find the teams' performance, we total the scores for each team.
team_scores = dict()
def add_scores_teams(name: str, score: float):
try:
team_scores[teams[name]] += score
except KeyError:
team_scores[teams[name]] = score
def update_team_scores_1(row):
add_scores_teams(row.home_team, row.home_score)
add_scores_teams(row.away_team, row.away_score)
return row
df_ag.apply(update_team_scores_1, axis='columns')
| type | game_id | home_team | away_team | week | season | state_of_game | game_url | home_score | away_score | Unnamed: 0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | post | 2010010901 | CIN | NYJ | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 14.0 | 24.0 | NaN |
| 1 | post | 2010010900 | DAL | PHI | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 34.0 | 14.0 | NaN |
| 2 | post | 2010011001 | NE | BAL | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 14.0 | 33.0 | NaN |
| 3 | post | 2010011000 | ARI | GB | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 51.0 | 45.0 | NaN |
| 4 | post | 2010011601 | NO | ARI | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 45.0 | 14.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3656 | reg | 2019122900 | BAL | PIT | 17 | 2019 | POST | http://www.nfl.com/liveupdate/game-center/2019... | 28.0 | 10.0 | NaN |
| 3657 | reg | 2019122907 | JAX | IND | 17 | 2019 | POST | http://www.nfl.com/liveupdate/game-center/2019... | 38.0 | 20.0 | NaN |
| 3658 | reg | 2019122913 | DEN | OAK | 17 | 2019 | POST | http://www.nfl.com/liveupdate/game-center/2019... | 16.0 | 15.0 | NaN |
| 3659 | reg | 2019122914 | LA | ARI | 17 | 2019 | POST | http://www.nfl.com/liveupdate/game-center/2019... | 31.0 | 24.0 | NaN |
| 3660 | reg | 2019122915 | SEA | SF | 17 | 2019 | POST | http://www.nfl.com/liveupdate/game-center/2019... | 21.0 | 26.0 | NaN |
3661 rows × 11 columns
We also total the number of games each team participated in.
team_games = dict()
def add_games_teams(name: str):
try:
team_games[teams[name]] += 1
except KeyError:
team_games[teams[name]] = 1
def update_team_games_1(row):
add_games_teams(row.home_team)
add_games_teams(row.away_team)
return row
df_ag = df_ag.apply(update_team_games_1, axis='columns')
df_ag.head(5)
| type | game_id | home_team | away_team | week | season | state_of_game | game_url | home_score | away_score | Unnamed: 0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | post | 2010010901 | CIN | NYJ | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 14.0 | 24.0 | NaN |
| 1 | post | 2010010900 | DAL | PHI | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 34.0 | 14.0 | NaN |
| 2 | post | 2010011001 | NE | BAL | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 14.0 | 33.0 | NaN |
| 3 | post | 2010011000 | ARI | GB | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 51.0 | 45.0 | NaN |
| 4 | post | 2010011601 | NO | ARI | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 45.0 | 14.0 | NaN |
A team's performance is measured with more than just their total score. Thus, also totalled and included is their win count, their winrate and their win/loss ratio. This is shown below.
team_wins = dict()
team_draws = dict()
team_losses = dict()
def add_wins_teams(name: str):
try:
team_wins[teams[name]] += 1
except KeyError:
team_wins[teams[name]] = 1
def add_draws_teams(name: str):
try:
team_draws[teams[name]] += 1
except KeyError:
team_draws[teams[name]] = 1
def add_losses_teams(name: str):
try:
team_losses[teams[name]] += 1
except KeyError:
team_losses[teams[name]] = 1
def update_team_wins_1(row):
if row.home_score > row.away_score:
add_wins_teams(row.home_team)
add_losses_teams(row.away_team)
elif row.home_score < row.away_score:
add_wins_teams(row.away_team)
add_losses_teams(row.home_team)
else:
add_draws_teams(row.home_team)
add_draws_teams(row.away_team)
return row
df_ag = df_ag.apply(update_team_wins_1, axis='columns')
df_ag.head(5)
| type | game_id | home_team | away_team | week | season | state_of_game | game_url | home_score | away_score | Unnamed: 0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | post | 2010010901 | CIN | NYJ | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 14.0 | 24.0 | NaN |
| 1 | post | 2010010900 | DAL | PHI | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 34.0 | 14.0 | NaN |
| 2 | post | 2010011001 | NE | BAL | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 14.0 | 33.0 | NaN |
| 3 | post | 2010011000 | ARI | GB | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 51.0 | 45.0 | NaN |
| 4 | post | 2010011601 | NO | ARI | 18 | 2009 | POST | http://www.nfl.com/liveupdate/game-center/2010... | 45.0 | 14.0 | NaN |
Unfortunately the data has some outliers where some teams far fewer games than all other teams. Upon further investigation these matches are played by "pro teams" that only occur a handful of times per year. These teams are removed due to irrelevance.
team_games = pd.Series({key: team_games[key] for key in team_games if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})
team_scores = pd.Series({key: team_scores[key] for key in team_scores if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})
team_wins = pd.Series({key: team_wins[key] for key in team_wins if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})
team_draws = pd.Series({key: team_draws[key] for key in team_draws if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})
team_losses = pd.Series({key: team_losses[key] for key in team_losses if key not in ["American Football Conference","National Football Conference","Team Rice","Team Sanders","Team Irvin","Team Carter"]})
Using the remaining data, we can see the statistics for each team: their mean score, their total score, their win count, their win rate and their win:loss ratio.
sns.set(rc={'figure.figsize':(15,9)})
mean_score = team_scores/team_games
mean_score = mean_score.sort_values(ascending=False)
ax = sns.barplot(x=mean_score.values, y=mean_score.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.30
ax.annotate(str(mean_score.values[i]), (x, y), ha='left')
plt.title("Highest Scoring NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Mean Points Scored")
plt.show()
sns.set(rc={'figure.figsize':(15,9)})
team_scores = team_scores.sort_values(ascending=False)
ax = sns.barplot(x=team_scores.values, y=team_scores.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.80
ax.annotate(str(team_scores.values[i]), (x, y), ha='left')
plt.title("Highest Scoring NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Total Points Scored")
plt.show()
sns.set(rc={'figure.figsize':(15,9)})
team_wins = team_wins.sort_values(ascending=False)
ax = sns.barplot(x=team_wins.values, y=team_wins.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.80
ax.annotate(str(team_wins.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Games Won")
plt.show()
sns.set(rc={'figure.figsize':(15,9)})
team_ratio = team_wins/team_games
team_ratio = team_ratio.sort_values(ascending=False)
ax = sns.barplot(x=team_ratio.values, y=team_ratio.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.005
ax.annotate(str(team_ratio.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Winrate")
plt.show()
sns.set(rc={'figure.figsize':(15,9)})
team_wl = team_wins/team_losses
team_wl = team_wl.sort_values(ascending=False)
ax = sns.barplot(x=team_wl.values, y=team_wl.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.01
ax.annotate(str(team_wl.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Win/Loss Ratio")
plt.show()
It is seen that the same few teams are on the top and bottom of each list, namely the New England Patriots and Green Bay Packers on top, and the Cleveland Browns on the bottom just to name a few.
Strategies are based on the plays done during the games.
A common question in American football is if touching down or kicking a field goal is best for scoring points. We shall now attempt to answer that question. We shall quantify that here by finding the number of points scored per games using each of these methods.
There are three different types of touchdowns: pass, rush and return. After each touchdown, which gives 6 points, a player can potentially score up to two points more. This needs to be factored in when finding how many points a touchdown is expected to score.
touchdown_ana = (pbp_combined_raw[pbp_combined_raw.touchdown==1])[["game_id","home_team","away_team","posteam","posteam_type","defteam","side_of_field","play_type","yards_gained","field_goal_result","extra_point_result","two_point_conv_result","touchdown","pass_touchdown","rush_touchdown","return_touchdown","extra_point_attempt","two_point_attempt", "pass_length", "pass_location", "kick_distance"]]
def td_type_fn(row):
if row.pass_touchdown == 1:
return "pass"
if row.rush_touchdown == 1:
return "rush"
if row.return_touchdown == 1:
return "return"
touchdown_ana["tdtype"] = touchdown_ana[["pass_touchdown","rush_touchdown","return_touchdown"]].apply(td_type_fn,
axis=1
)
(touchdown_ana.touchdown*6+touchdown_ana.extra_point_attempt+2*touchdown_ana.two_point_attempt).value_counts()
7.0 15298 6.0 1739 8.0 971 dtype: int64
#mean points gained from each touchdown after accounting for extra point attempts
(touchdown_ana.touchdown*6+touchdown_ana.extra_point_attempt+2*touchdown_ana.two_point_attempt).mean()
6.957352287872057
From this we see that a touchdown has an expected value of around 7 points. We now calculate total points scored from touchdowns and field goals per game:
touchdown_ana["td_points"] = (touchdown_ana.touchdown*6+touchdown_ana.extra_point_attempt+2*touchdown_ana.two_point_attempt)
touchdown_ana.groupby("game_id").td_points.sum()
game_id
2009080950 27.0
2009081350 14.0
2009081351 35.0
2009081352 14.0
2009081353 35.0
...
2020011201 50.0
2020011900 56.0
2020011901 50.0
2020012600 71.0
2020020200 42.0
Name: td_points, Length: 3636, dtype: float64 # mean points earned from touchdowns per game
td_per_g = touchdown_ana.groupby("game_id").td_points.sum().mean()
td_per_g
34.457645764576455
# field goal analysis dataframe
fg_ana = (pbp_combined_raw[pbp_combined_raw.play_type=="field_goal"])[["game_id","home_team","away_team","posteam","posteam_type","defteam","side_of_field","play_type","yards_gained","field_goal_result","extra_point_result","two_point_conv_result","touchdown","pass_touchdown","rush_touchdown","return_touchdown","extra_point_attempt","two_point_attempt", "pass_length", "pass_location", "kick_distance"]]
We find the number of field goals for each game. Only the entries with field_goal_result equal to made is a field goal.
fg_ana.groupby("game_id").field_goal_result.value_counts()
game_id field_goal_result
2009080950 made 3
2009081350 made 3
missed 1
2009081351 made 6
blocked 1
..
2020011201 missed 1
2020011900 made 1
2020011901 made 3
2020012600 made 1
2020020200 made 3
Name: field_goal_result, Length: 5371, dtype: int64 # mean points earned from field goals per game
fg_per_g = fg_ana[fg_ana.field_goal_result=="made"].groupby("game_id").field_goal_result.count().mean()*3
fg_per_g
9.776371308016877
plt.figure(figsize=(10, 4))
plt.bar(['Touchdown', 'Field Goal'], [td_per_g, fg_per_g])
plt.xlabel("Point Scoring Method")
plt.ylabel("Mean Points Scored per Game")
plt.title("Mean Points Scored per Game (Overall)")
Text(0.5, 1.0, 'Mean Points Scored per Game (Overall)')
td_d = touchdown_ana.copy()
fg_d = fg_ana.copy()
td_d.index = pd.to_datetime((touchdown_ana.game_id//100).astype(str))
fg_d.index = pd.to_datetime((fg_ana.game_id//100).astype(str))
td_d = td_d.drop("game_id", axis=1).groupby("game_id").td_points.sum()
fg_d = fg_d.drop("game_id", axis=1)
fg_d = fg_d[fg_d.field_goal_result=="made"].groupby("game_id").field_goal_result.count()*3
pd.concat([td_d,fg_d], axis=1).rename(columns={"td_points":"Touchdown", "field_goal_result":"Field Goal"}).reset_index(drop=True).plot()
plt.xlabel("Game No.")
plt.ylabel("Score per Game")
plt.title("Points Scored from Each Point Scoring Method per Game")
Text(0.5, 1.0, 'Points Scored from Each Point Scoring Method per Game')
From here we can see that field goals score much lower than touchdowns per game. Thus, touchdowns are better than field goals at scoring.
To find out the optimal touchdown strategies, we compare each type of touchdown:
fig = plt.figure(figsize=(10,10))
plt.bar(["Pass","Rush","Return"],
[touchdown_ana.groupby("game_id").pass_touchdown.sum().mean(),
touchdown_ana.groupby("game_id").rush_touchdown.sum().mean(),
touchdown_ana.groupby("game_id").return_touchdown.sum().mean()])
plt.ylabel("Mean Number")
plt.xlabel("Type of Touchdown")
plt.title("Mean Number of Touchdowns Per Game")
plt.show()
Clearly, pass touchdowns are the most common form of touchdown and are thus the most viable.
pass_td = touchdown_ana.groupby("pass_length").pass_touchdown.sum()
return_td = touchdown_ana.groupby("pass_length").return_touchdown.sum()
fig = plt.figure(figsize=(10,10))
td_bar_df = pd.DataFrame({"pass":pass_td,"return":return_td},index=["deep","short"])
td_bar_df.T.plot(kind= 'bar')
plt.xlabel("Type of Touchdown")
plt.ylabel("Number of Touchdowns")
plt.title("Pass Length vs Overall Touchdowns by Type")
plt.show()
<Figure size 720x720 with 0 Axes>
It is clear that short passes are multiple times better than deep passes at scoring touchdowns. Pass touchdowns also seem to be more effective than return touchdowns.
It is also asked, what is the best position to score a touchdown? There are three pass locations in the dataset, namely left, right and middle.
touchdown_strat_df = pd.DataFrame({"Pass": touchdown_ana.groupby("pass_location").pass_touchdown.sum(), "Return": touchdown_ana.groupby("pass_location").return_touchdown.sum()})
touchdown_strat_df.T.plot(kind= 'bar')
plt.xlabel("Type of Touchdown")
plt.ylabel("Number of Touchdowns")
plt.title("Pass Location vs Overall Touchdowns by Type")
plt.show()
Overall, the touchdown strategy can be determined by grouping touchdowns based on the type of touchdown, as well the pass length and the pass location for pass and return touchdowns, and the yards gained for rush touchdowns.
We now do the same for the field goal data, and finding the strategy of where best to kick a field goal from.
fig = plt.figure(figsize=(12,5))
sns.violinplot(y=pd.concat([fg_ana["kick_distance"],fg_ana["kick_distance"]], ignore_index=True),
x=pd.concat([pd.Series(len(fg_ana["field_goal_result"])*["total"]),fg_ana["field_goal_result"]], ignore_index=True),
data=fg_ana);
plt.xlabel("Type of Field Goal")
plt.ylabel("Kick Distance")
plt.title("Violinplot of Kick Distance vs Field Goals Scored")
plt.show()
From this we can see that not only is the chance of making a field goal the highest at about 30 yards away from the goalpost. Both missing and blocking is substantially higher at around 50 yards from the goalpost, where the maximum number of field goal attempts are made. Thus, the best distance from the goalpost to try to score a field goal is slightly lower than the overall median, which is around 39 yards, concluding the overall field goal strategy.
Additionally, we will find out if passing, punting or running is the best method to gain yards.
yds_ana = pbp_combined_raw[["game_id","home_team","away_team","posteam","posteam_type","defteam","side_of_field","play_type","yards_gained", "pass_length", "pass_location", "kick_distance"]]
yds_ana = yds_ana[yds_ana.play_type.isin(["pass","run", "punt"])]
def punt_yards(row):
if row.play_type=="punt":
row.yards_gained = row.kick_distance
return row
yds_ana = yds_ana.apply(punt_yards, axis="columns")
yds_ana1 = yds_ana[yds_ana.play_type.isin(["run","pass","punt"])]
fig = plt.figure(figsize=(15,15))
sns.violinplot(y=yds_ana1.yards_gained,
x=yds_ana1.play_type,
data=yds_ana1);
plt.xlabel("Play Type")
plt.ylabel("Yards Gained")
plt.title("Violinplot of Type of Play and Yards Gained")
plt.show()
Thus, the best method of gaining yards is punting. However, it is noted that punts occur much less than passes or runs and as a result, players should not just punt to gain yards.
Shockingly, many passes have a yards_gained of zero. After further investigations, it seems that a pass in the dataset can either be incomplete, resulting in only an air_yards entry, or a complete pass, resulting in an air_yards entry but having zero for the yards_after_catch entry, or positive values for both air_yards and yards_after_catch. We shall now analyse this to find out more about passes.
A pass is counted as when a player throws the ball to another player (the distance is air_yards), and when the said other player runs with the ball until he loses possession (the distance is yards_after_catch).
pass_ana = pbp_combined_raw[["game_id","home_team","away_team","posteam","posteam_type","defteam","side_of_field","play_type","yards_gained", "air_yards", "yards_after_catch"]][pbp_combined_raw.play_type=="pass"]
pass_ana
| game_id | home_team | away_team | posteam | posteam_type | defteam | side_of_field | play_type | yards_gained | air_yards | yards_after_catch | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | pass | 0.0 | 5.0 | NaN |
| 3 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | pass | 4.0 | 4.0 | 0.0 |
| 5 | 2010010901 | CIN | NYJ | NYJ | away | CIN | NYJ | pass | 11.0 | 1.0 | 10.0 |
| 9 | 2010010901 | CIN | NYJ | NYJ | away | CIN | NYJ | pass | 1.0 | 1.0 | 0.0 |
| 12 | 2010010901 | CIN | NYJ | CIN | home | NYJ | NYJ | pass | 0.0 | 8.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 645292 | 2019122915 | SEA | SF | SEA | home | SF | SF | pass | 0.0 | 12.0 | NaN |
| 645295 | 2019122915 | SEA | SF | SEA | home | SF | SF | pass | 11.0 | 11.0 | 0.0 |
| 645298 | 2019122915 | SEA | SF | SEA | home | SF | SF | pass | 0.0 | 5.0 | NaN |
| 645299 | 2019122915 | SEA | SF | SEA | home | SF | SF | pass | 0.0 | 5.0 | NaN |
| 645300 | 2019122915 | SEA | SF | SEA | home | SF | SF | pass | 4.0 | 4.0 | 0.0 |
267239 rows × 11 columns
# find mean of air yards divided by yards gained
pass_ana1 = pass_ana[pass_ana.yards_gained!=0]
(pass_ana1.air_yards/pass_ana1.yards_gained).mean()
0.5024565627059023
pass_dist = [pass_ana[pass_ana.yards_after_catch.isna()].size, # incomplete passes
pass_ana[(pass_ana.yards_after_catch==0) & (pass_ana.air_yards>0)].size, # zero yards after catch passes
pass_ana[(pass_ana.yards_after_catch>0) & (pass_ana.air_yards>0)].size] # positive value passes
fig = px.pie(values=pass_dist,
names=["Incomplete Passes", "Zero Yards After Catch Passes", "Other Passes"],
color_discrete_sequence=px.colors.sequential.Turbo)
fig.show()
Performance can be measured in several ways, and the ones that will be analysed here are the total score, the score ratio and the score difference.
The weather data includes measurements of temperature, humidity, wind speed and wind chill. We draw a scatterplot and a boxplot for each of these measurements, with total score as the y-axis and the measurement as the x-axis.
weather["total_score"] = weather.home_score + weather.away_score
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["total_score"], x=weather["temperature"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["total_score"], x=weather["humidity"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["total_score"], x=weather["wind_mph"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["total_score"], x=weather["wind_chill"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["temperature"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["humidity"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_mph"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_chill"], data=weather)
plt.show()
We observe that the higher the wind speed, the lower the total score. Unfortunately, the temperature, humidity and wind chill all do not seem to have an effect on teams' total score. We shall repeat this but with the winning team's goals divided by the losing team's goals.
weather["win_ratio"] = np.maximum(weather.home_score,weather.away_score)/np.minimum(weather.home_score,weather.away_score)
weather["win_ratio"]
0 2.047619
1 1.250000
2 1.214286
3 1.105263
4 inf
...
11187 1.700000
11188 1.050000
11189 2.857143
11190 2.428571
11191 2.857143
Name: win_ratio, Length: 11192, dtype: float64 plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["win_ratio"], x=weather["temperature"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["win_ratio"], x=weather["humidity"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["win_ratio"], x=weather["wind_mph"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["win_ratio"], x=weather["wind_chill"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["temperature"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["humidity"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_mph"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_chill"], data=weather)
plt.show()
We observe that the higher the wind speed, the lower the total score. Unfortunately, the temperature, humidity and wind chill all do not seem to have an effect on teams' total score. We shall repeat this but with the winning team's goals divided by the losing team's goals.
weather["win_ratio"] = np.maximum(weather.home_score,weather.away_score)/np.minimum(weather.home_score,weather.away_score)
weather["win_ratio"]
0 2.047619
1 1.250000
2 1.214286
3 1.105263
4 inf
...
11187 1.700000
11188 1.050000
11189 2.857143
11190 2.428571
11191 2.857143
Name: win_ratio, Length: 11192, dtype: float64 plt.figure(figsize=(15,10))
sns.boxplot(y=weather["win_ratio"], x=weather["temperature"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["win_ratio"], x=weather["humidity"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["win_ratio"], x=weather["wind_mph"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["win_ratio"], x=weather["wind_chill"], data=weather)
plt.show()
Unfortunately, the temperature, humidity, wind speed and wind chill all do not seem to have an effect on teams' win ratio. We shall repeat this but with the difference in scores.
weather["score_diff"] = np.maximum(weather.home_score,weather.away_score)-np.minimum(weather.home_score,weather.away_score)
weather["score_diff"]
0 22
1 7
2 3
3 2
4 20
..
11187 14
11188 1
11189 13
11190 20
11191 13
Name: score_diff, Length: 11192, dtype: int64 plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["score_diff"], x=weather["temperature"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["score_diff"], x=weather["humidity"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["score_diff"], x=weather["wind_mph"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.scatterplot(y=weather["score_diff"], x=weather["wind_chill"], data=weather)
plt.show()
weather["score_diff"] = np.maximum(weather.home_score,weather.away_score)-np.minimum(weather.home_score,weather.away_score)
weather["score_diff"]
0 22
1 7
2 3
3 2
4 20
..
11187 14
11188 1
11189 13
11190 20
11191 13
Name: score_diff, Length: 11192, dtype: int64 plt.figure(figsize=(15,10))
sns.boxplot(y=weather["score_diff"], x=weather["temperature"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["score_diff"], x=weather["humidity"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["score_diff"], x=weather["wind_mph"], data=weather)
plt.show()
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["score_diff"], x=weather["wind_chill"], data=weather)
plt.show()
Unfortunately, the temperature, humidity, wind speed and wind chill all do not seem to have an effect on teams' score difference. We can thus conclude that the only weather factor that has a correlation with teams' performance is wind speed on total score.
def doy(y,m,d):
dom = np.array([31,28,31,30,31,30,31,31,30,31,30,31])
total = 0
if ((y%4==0 and y%100!=0) or y%400==0) and m >= 3:
total += 1
total += np.sum(dom[:(m-1)])
total += d
return total
def ds2010(y,m,d):
total = doy(y,m,d)
for i in range(2010, y):
if (y%4==0 and y%100!=0) or y%400==0:
total += 366
else:
total += 365
return total
ag_date = pd.to_datetime((df_ag.game_id//100).astype(str))
df_ag["total_score"] = df_ag.home_score + df_ag.away_score
df_ag.insert(2,"date", ag_date)
games_by_date = df_ag.set_index("date")
fig, ax = plt.subplots(2,1,figsize=(15,10))
score_means = df_ag.groupby("date").total_score.mean()
score_means.plot(ax=ax[0])
score_means_rolling = df_ag.groupby("date").total_score.rolling(3).mean().reset_index(1,drop=True)
score_means_rolling.plot(ax=ax[1])
<AxesSubplot:xlabel='date'>
srs = score_means_rolling[score_means_rolling.index.year == 2009]
srs.reset_index(drop=True)
0 NaN
1 NaN
2 NaN
3 35.000000
4 31.333333
...
300 41.666667
301 41.333333
302 42.333333
303 39.333333
304 NaN
Name: total_score, Length: 305, dtype: float64 fig, ax = plt.subplots(6,2)
for yr in range(2009,2021):
srs = score_means[score_means.index.year == yr]
srs = srs.reset_index(drop=True)
srs.plot(ax=ax[(yr-2009)//2,1-yr%2])
The unique positions are shown below:
rosters.position.unique()
array(['Quarterback', 'Running Back', 'Tight End', 'Wide Receiver',
'Full Back'], dtype=object) We first match the team name abbreviations to the names of the teams using the dictionary from earlier:
rosters.team = rosters.team.map(teams)
rosters.head(5)
| season | season_type | full_player_name | abbr_player_name | team | position | gsis_id | |
|---|---|---|---|---|---|---|---|
| 0 | 2009 | post | Aaron Rodgers | A.Rodgers | Green Bay Packers | Quarterback | 00-0023459 |
| 1 | 2009 | post | Adrian Peterson | A.Peterson | Minnesota Vikings | Running Back | 00-0025394 |
| 2 | 2009 | post | Ahman Green | A.Green | Green Bay Packers | Running Back | 00-0006305 |
| 3 | 2009 | post | Alex Smith | A.Smith | Philadelphia Eagles | Tight End | 00-0023506 |
| 4 | 2009 | post | Andre Caldwell | A.Caldwell | Cincinnati Bengals | Wide Receiver | 00-0026237 |
Then, we see which position in which team has been played the most.
# most played team and position
rosters.groupby(["team","position"]).full_player_name.nunique().idxmax()
('Detroit Lions', 'Wide Receiver') # number of times the max team and position has been played
rosters.groupby(["team","position"]).full_player_name.nunique().max()
71
# all said players
rosters[(rosters.team == "Detroit Lions") & (rosters.position == "Wide Receiver")].full_player_name.unique()
array(['Calvin Johnson', 'Nate Burleson', 'Titus Young', 'Corey Fuller',
'Golden Tate', 'Jeremy Ross', 'Anquan Boldin', 'Marvin Jones',
'T.J. Jones', 'Billy McMullen', 'Bobby Sippio', 'D.J. Boldin',
'Dane Looker', 'Keary Colbert', 'Brian Clark', 'Bryant Johnson',
'Dennis Northcutt', 'Derrick Williams', 'Eric Fowler', 'Tim Toone',
'Demario Ballard', 'Dominique Barnes', 'Marcus Harris',
'Maurice Stovall', 'Nate Hughes', 'Rashied Davis',
'Dominique Curry', 'Jarett Dillard', 'Lance Long', 'Pat Edwards',
'Ryan Broyles', 'Stefan Logan', 'Terrence Toliver',
'Wallace Miles', 'Cody Wilson', 'Kris Durham', 'Matt Willis',
'Mike Thomas', 'Terrence Austin', 'Andrew Peacock',
'Quintin Payton', 'Jarred Haggins', 'Lance Moore',
'Vernon Johnson', 'Andre Roberts', 'Jace Billingsley', 'Jay Lee',
'Quinshad Davis', 'Dontez Ford', 'Jared Abbrederis',
'Kenny Golladay', 'Keshawn Martin', 'Michael Rector',
'Noel Thomas', 'Bradley Marquez', 'Brandon Powell', 'Brian Brown',
'Chris Lacy', 'Andy Jones', 'Danny Amendola', 'Deontez Alexander',
'Jonathan Duhart', 'Jordan Lasley', 'Tom Kennedy',
'Tommylee Lewis', 'Travis Fulgham', 'John Standeford',
'Brian Robiskie', 'Dorin Dickerson', 'Kevin Ogletree',
'Bruce Ellington'], dtype=object) We'll find which player has been in the most number of different teams.
rosters[rosters.gsis_id == rosters.groupby(["gsis_id"]).team.nunique().idxmax()]
| season | season_type | full_player_name | abbr_player_name | team | position | gsis_id | |
|---|---|---|---|---|---|---|---|
| 3171 | 2012 | pre | Terrelle Pryor | T.Pryor | Las Vegas Raiders | Quarterback | 00-0028825 |
| 3839 | 2013 | pre | Terrelle Pryor | T.Pryor | Las Vegas Raiders | Quarterback | 00-0028825 |
| 4512 | 2014 | pre | Terrelle Pryor | T.Pryor | Seattle Seahawks | Quarterback | 00-0028825 |
| 5804 | 2016 | pre | Terrelle Pryor | T.Pryor | Cleveland Browns | Wide Receiver | 00-0028825 |
| 6482 | 2017 | pre | Terrelle Pryor | T.Pryor | Washington Commanders | Wide Receiver | 00-0028825 |
| 7228 | 2018 | pre | Terrelle Pryor | T.Pryor | New York Jets | Wide Receiver | 00-0028825 |
| 7952 | 2019 | pre | Terrelle Pryor | T.Pryor | Jacksonville Jaguars | Wide Receiver | 00-0028825 |
| 10049 | 2012 | reg | Terrelle Pryor | T.Pryor | Las Vegas Raiders | Quarterback | 00-0028825 |
| 10564 | 2013 | reg | Terrelle Pryor | T.Pryor | Las Vegas Raiders | Quarterback | 00-0028825 |
| 11609 | 2015 | reg | Terrelle Pryor | T.Pryor | Cleveland Browns | Wide Receiver | 00-0028825 |
| 12152 | 2016 | reg | Terrelle Pryor | T.Pryor | Cleveland Browns | Wide Receiver | 00-0028825 |
| 12684 | 2017 | reg | Terrelle Pryor | T.Pryor | Washington Commanders | Wide Receiver | 00-0028825 |
| 13229 | 2018 | reg | Terrelle Pryor | T.Pryor | Buffalo Bills | Wide Receiver | 00-0028825 |
We note that the player who played for the most different teams is Terelle Pryor, who has played for seven different teams from 2012 to 2019.
Now, let's find which player has played in the most number of positions.
# find the maximum number of positions a single player has been in
rosters.groupby("gsis_id").position.nunique().max()
2
position_df = rosters.groupby("gsis_id").position.nunique().to_frame()
position_df = position_df[position_df.position == position_df.position.max()].rename(columns={"position":"pos_count"})
It seems that multiple players have played in 2 positions before. We will now find their names below.
max_positions = pd.merge(rosters[["full_player_name","gsis_id","position"]],
position_df,
on='gsis_id',
how='inner')
with np.printoptions(threshold=np.inf):
print(max_positions.full_player_name.unique())
['Danny Woodhead' 'Dexter McCluster' 'Jason Snelling' 'Michael Robinson' 'James Casey' 'Lousaka Polite' 'Jacob Hester' 'Joe Webb' 'Mike Tolbert' 'Robert Hughes' 'Jamize Olawale' 'Ty Montgomery' 'Zach Zenner' 'C.J. Ham' 'Logan Thomas' 'Zach Line' 'Brock Bolen' 'Dorin Dickerson' 'Earnest Graham' 'Ernest Wilford' 'Greg Jones' 'Jamie McCoy' 'Jason Pociask' 'John Conner' 'Lex Hilliard' 'Manase Tonga' 'Montell Owens' 'Peyton Hillis' 'Tim Castille' 'Frank Summers' 'Mike Sellers' 'Niles Paul' 'Stanley Havili' 'Emil Igwenagu' 'Terrelle Pryor' 'Gerell Robinson' 'Ifeanyi Momah' 'Jason Schepler' 'Jeremy Stewart' 'Patrick DiMarco' 'Anthony Sherman' "De'Anthony Thomas" 'Marcus Lucas' 'Daniel Brown' 'Darren Waller' 'Neal Sterling' 'Byron Marshall' 'Moritz Boehringer' 'Shane Smith' 'Vince Mayle' 'Bug Howard']
max_positions.groupby("full_player_name").position.unique().apply(set).value_counts()
{Running Back, Full Back} 24
{Wide Receiver, Tight End} 12
{Wide Receiver, Running Back} 5
{Tight End, Full Back} 5
{Tight End, Running Back} 2
{Wide Receiver, Quarterback} 2
{Tight End, Quarterback} 1
Name: position, dtype: int64 It seems that most players who double their roles do so as running backs and full backs, followed by wide receivers and tight ends.
We now find the trends in the details of the players themselves.
sns.violinplot(players.year_end-players.year_start)
plt.xlabel("Years Played")
plt.title("Violinplot of Number of Years Played by Players")
Text(0.5, 1.0, 'Violinplot of Number of Years Played by Players')
We observe that the median number of years played is 2, and the maximum is 21.
sns.boxplot(x=players.year_start, y=players.weight)
<AxesSubplot:xlabel='year_start', ylabel='weight'>
sns.boxplot(x=players.year_end-players.year_start, y=players.weight)
plt.xlabel("Years Played")
Text(0.5, 0, 'Years Played')
There does not seem to be a correlation between the number of years a player has played or when they started playing and their weight.
sns.set(rc={'figure.figsize':(15,9)})
mean_score = team_scores/team_games
mean_score = mean_score.sort_values(ascending=False)
ax = sns.barplot(x=mean_score.values, y=mean_score.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.30
ax.annotate(str(mean_score.values[i]), (x, y), ha='left')
plt.title("Highest Scoring NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Mean Points Scored")
plt.show()
sns.set(rc={'figure.figsize':(15,9)})
team_scores = team_scores.sort_values(ascending=False)
ax = sns.barplot(x=team_scores.values, y=team_scores.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.80
ax.annotate(str(team_scores.values[i]), (x, y), ha='left')
plt.title("Highest Scoring NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Total Points Scored")
plt.show()
sns.set(rc={'figure.figsize':(15,9)})
team_wins = team_wins.sort_values(ascending=False)
ax = sns.barplot(x=team_wins.values, y=team_wins.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.80
ax.annotate(str(team_wins.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Games Won")
plt.show()
sns.set(rc={'figure.figsize':(15,9)})
team_ratio = team_wins/team_games
team_ratio = team_ratio.sort_values(ascending=False)
ax = sns.barplot(x=team_ratio.values, y=team_ratio.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.005
ax.annotate(str(team_ratio.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Winrate")
plt.show()
sns.set(rc={'figure.figsize':(15,9)})
team_wl = team_wins/team_losses
team_wl = team_wl.sort_values(ascending=False)
ax = sns.barplot(x=team_wl.values, y=team_wl.index)
sns.despine(left=True, bottom=True)
patches = ax.patches
for i in range(len(patches)):
y = patches[i].get_y() + patches[i].get_height()-0.05
x = patches[i].get_width() + 0.01
ax.annotate(str(team_wl.values[i]), (x, y), ha='left')
plt.title("Highest Winning NFL Teams from 2009 to 2019")
plt.ylabel('Team')
plt.xlabel("Win/Loss Ratio")
plt.show()
The New England Patriots and Green Bay Packers have the first and third highest results respectively in all the graphs. Baltimore Ravens has the second highest result in 3 graphs and New Orleans Saints has the second highest result in 2 graphs. On the other hand, the Jacksonville Jaguars and Cleveland Browns are consistently the bottom two teams. Thus, we can conclude that the New England Patriots is the best team overall, having both won the most games and scored the most per game.
The strategy for scoring field goals is shown below:
fig = plt.figure(figsize=(12,5))
sns.violinplot(y=pd.concat([fg_ana["kick_distance"],fg_ana["kick_distance"]], ignore_index=True),
x=pd.concat([pd.Series(len(fg_ana["field_goal_result"])*["total"]),fg_ana["field_goal_result"]], ignore_index=True),
data=fg_ana);
plt.xlabel("Type of Field Goal")
plt.ylabel("Kick Distance")
plt.title("Violinplot of Kick Distance vs Total Field Goals Scored")
plt.show()
From this we can see that not only is the chance of making a field goal the highest at about 30 yards away from the goalpost. Both missing and blocking is substantially higher at around 50 yards from the goalpost, where the maximum number of field goal attempts are made. Thus, the best distance from the goalpost to try to score a field goal is slightly lower than the overall median, which is around 39 yards, concluding the field goal strategy.
fig = plt.figure(
FigureClass=Waffle,
rows=10,
columns=40,
values=[td_per_g, fg_per_g],
figsize=(12, 8), # figsize is a parameter of matplotlib.pyplot.figure
labels = ["Touchdown points per game", "Field Goal points per game"],
legend={
'loc': 'lower left',
'bbox_to_anchor': (0, -0.2),
'framealpha': 0,
'fontsize': 12
}
)
plt.title("Average Touchdown Points vs Field Goal Points scored per game")
plt.show()
The mean points scored per game overall is higher for touchdowns than for field goals.
plt.figure(figsize=(10, 4))
plt.bar(['Touchdown', 'Field Goal'], [td_per_g, fg_per_g])
plt.xlabel("Point Scoring Method")
plt.ylabel("Mean Points Scored per Game")
plt.title("Mean Points Scored per Game (Overall)")
Text(0.5, 1.0, 'Mean Points Scored per Game (Overall)')
Below, we can see that field goals score much lower than touchdowns per game. Combined with the above bar chart, touchdowns are better than field goals at scoring.
pd.concat([td_d,fg_d], axis=1).rename(columns={"td_points":"Touchdown", "field_goal_result":"Field Goal"}).reset_index(drop=True).plot()
plt.xlabel("Game No.")
plt.ylabel("Score per Game")
plt.title("Points Scored from Each Point Scoring Method per Game")
Text(0.5, 1.0, 'Points Scored from Each Point Scoring Method per Game')
To find out the optimal touchdown strategies, we compare each type of touchdown:
fig = plt.figure(figsize=(10,10))
plt.bar(["Pass","Rush","Return"],
[touchdown_ana.groupby("game_id").pass_touchdown.sum().mean(),
touchdown_ana.groupby("game_id").rush_touchdown.sum().mean(),
touchdown_ana.groupby("game_id").return_touchdown.sum().mean()])
plt.ylabel("Mean Number")
plt.xlabel("Type of Touchdown")
plt.title("Mean Number of Touchdowns Per Game")
plt.show()
Pass touchdowns are the most common form of touchdown.
fig = plt.figure(figsize=(10,10))
td_bar_df = pd.DataFrame({"pass":pass_td,"return":return_td},index=["deep","short"])
td_bar_df.T.plot(kind= 'bar')
plt.xlabel("Type of Touchdown")
plt.ylabel("Number of Touchdowns")
plt.title("Pass Length vs Overall Touchdowns by Type")
plt.show()
<Figure size 720x720 with 0 Axes>
It is clear that short passes are multiple times better than deep passes at scoring touchdowns. Pass touchdowns also seem to be more effective than return touchdowns.
The right side of the field is the best place to score a touchdown, and the middle of the field is the worst.
touchdown_strat_df = pd.DataFrame({"Pass": touchdown_ana.groupby("pass_location").pass_touchdown.sum(), "Return": touchdown_ana.groupby("pass_location").return_touchdown.sum()})
touchdown_strat_df.T.plot(kind= 'bar')
plt.xlabel("Type of Touchdown")
plt.ylabel("Number of Touchdowns")
plt.title("Pass Location vs Overall Touchdowns by Type")
plt.show()
Overall, the touchdown strategy can be determined by grouping touchdowns based on the type of touchdown, as well the pass length and the pass location for pass and return touchdowns, and the yards gained for rush touchdowns.
Teh strategy of where best to kick a field goal from is shown below.
fig = plt.figure(figsize=(12,5))
sns.violinplot(y=pd.concat([fg_ana["kick_distance"],fg_ana["kick_distance"]], ignore_index=True),
x=pd.concat([pd.Series(len(fg_ana["field_goal_result"])*["total"]),fg_ana["field_goal_result"]], ignore_index=True),
data=fg_ana);
plt.xlabel("Type of Field Goal")
plt.ylabel("Kick Distance")
plt.title("Violinplot of Kick Distance vs Field Goals Scored")
plt.show()
From this we can see that not only is the chance of making a field goal the highest at about 30 yards away from the goalpost. Both missing and blocking is substantially higher at around 50 yards from the goalpost, where the maximum number of field goal attempts are made. Thus, the best distance from the goalpost to try to score a field goal is slightly lower than the overall median, which is around 39 yards, concluding the overall field goal strategy.
Out of passing, punting and running, punting is the best method to gain yards as shown below.
yds_ana1 = yds_ana[yds_ana.play_type.isin(["run","pass","punt"])]
fig = plt.figure(figsize=(15,15))
sns.violinplot(y=yds_ana1.yards_gained,
x=yds_ana1.play_type,
data=yds_ana1);
plt.xlabel("Play Type")
plt.ylabel("Yards Gained")
plt.title("Violinplot of Type of Play and Yards Gained")
plt.show()
fig = px.pie(values=pass_dist,
names=["Incomplete Passes", "Zero Yards After Catch Passes", "Other Passes"],
color_discrete_sequence=px.colors.sequential.Turbo)
fig.show()
The only environmental factor that affects any team performance indicator is wind speed, and it only affects the total score. As the wind speed increases from 0 to 21 miles per hour, the total score consistently decreases.
plt.figure(figsize=(15,10))
sns.boxplot(y=weather["total_score"], x=weather["wind_mph"], data=weather)
plt.xlabel("Wind Speed in Miles Per Hour")
plt.ylabel("Total Score per Game")
plt.title("Wind Speed vs Total Score per Game")
plt.show()
rosters[rosters.gsis_id == rosters.groupby(["gsis_id"]).team.nunique().idxmax()]
| season | season_type | full_player_name | abbr_player_name | team | position | gsis_id | |
|---|---|---|---|---|---|---|---|
| 3171 | 2012 | pre | Terrelle Pryor | T.Pryor | Las Vegas Raiders | Quarterback | 00-0028825 |
| 3839 | 2013 | pre | Terrelle Pryor | T.Pryor | Las Vegas Raiders | Quarterback | 00-0028825 |
| 4512 | 2014 | pre | Terrelle Pryor | T.Pryor | Seattle Seahawks | Quarterback | 00-0028825 |
| 5804 | 2016 | pre | Terrelle Pryor | T.Pryor | Cleveland Browns | Wide Receiver | 00-0028825 |
| 6482 | 2017 | pre | Terrelle Pryor | T.Pryor | Washington Commanders | Wide Receiver | 00-0028825 |
| 7228 | 2018 | pre | Terrelle Pryor | T.Pryor | New York Jets | Wide Receiver | 00-0028825 |
| 7952 | 2019 | pre | Terrelle Pryor | T.Pryor | Jacksonville Jaguars | Wide Receiver | 00-0028825 |
| 10049 | 2012 | reg | Terrelle Pryor | T.Pryor | Las Vegas Raiders | Quarterback | 00-0028825 |
| 10564 | 2013 | reg | Terrelle Pryor | T.Pryor | Las Vegas Raiders | Quarterback | 00-0028825 |
| 11609 | 2015 | reg | Terrelle Pryor | T.Pryor | Cleveland Browns | Wide Receiver | 00-0028825 |
| 12152 | 2016 | reg | Terrelle Pryor | T.Pryor | Cleveland Browns | Wide Receiver | 00-0028825 |
| 12684 | 2017 | reg | Terrelle Pryor | T.Pryor | Washington Commanders | Wide Receiver | 00-0028825 |
| 13229 | 2018 | reg | Terrelle Pryor | T.Pryor | Buffalo Bills | Wide Receiver | 00-0028825 |
The maximum number of teams someone has been in is 7, and that person is Terelle Pryor.
A distribution of how many teams each player has played in is shown below. Surprisingly, almost 40% of players have been in more than one team.
dist_num_teams = rosters.groupby(["gsis_id"]).team.nunique().reset_index(drop=True).value_counts()
fig = px.pie(values=dist_num_teams,
names=dist_num_teams.index,
color_discrete_sequence=px.colors.sequential.Aggrnyl)
fig.show()
The maximum number of positions someone has been in is 2:
rosters.groupby("gsis_id").position.nunique().max()
2
A total of 51 players have played in 2 positions, as shown below.
max_positions.full_player_name.unique().size
51
position_df = rosters.groupby("gsis_id").position.nunique().to_frame()
position_df = position_df[position_df.position == position_df.position.max()].rename(columns={"position":"pos_count"})
max_positions = pd.merge(rosters[["full_player_name","gsis_id","position"]],
position_df,
on='gsis_id',
how='inner')
pos_pairs = max_positions.groupby("full_player_name").position.unique().apply(set).value_counts()
with np.printoptions(threshold=np.inf):
print(max_positions.full_player_name.unique())
['Danny Woodhead' 'Dexter McCluster' 'Jason Snelling' 'Michael Robinson' 'James Casey' 'Lousaka Polite' 'Jacob Hester' 'Joe Webb' 'Mike Tolbert' 'Robert Hughes' 'Jamize Olawale' 'Ty Montgomery' 'Zach Zenner' 'C.J. Ham' 'Logan Thomas' 'Zach Line' 'Brock Bolen' 'Dorin Dickerson' 'Earnest Graham' 'Ernest Wilford' 'Greg Jones' 'Jamie McCoy' 'Jason Pociask' 'John Conner' 'Lex Hilliard' 'Manase Tonga' 'Montell Owens' 'Peyton Hillis' 'Tim Castille' 'Frank Summers' 'Mike Sellers' 'Niles Paul' 'Stanley Havili' 'Emil Igwenagu' 'Terrelle Pryor' 'Gerell Robinson' 'Ifeanyi Momah' 'Jason Schepler' 'Jeremy Stewart' 'Patrick DiMarco' 'Anthony Sherman' "De'Anthony Thomas" 'Marcus Lucas' 'Daniel Brown' 'Darren Waller' 'Neal Sterling' 'Byron Marshall' 'Moritz Boehringer' 'Shane Smith' 'Vince Mayle' 'Bug Howard']
max_positions.full_player_name.unique().size
51
pos_pairs.index = pos_pairs.index.astype("str")
fig = px.pie(values=pos_pairs,
names=pos_pairs.index,
color_discrete_sequence=px.colors.sequential.Jet)
fig.show()
In conclusion, most players who double their roles do so as running backs and full backs with 47.1% of them doing so, followed by wide receivers and tight ends at 23.5% and 9.8% respectively.
Most players also play for two years or less, as shown below.
sns.violinplot(players.year_end-players.year_start)
plt.xlabel("Years Played")
plt.title("Violinplot of Years Played per Player")
Text(0.5, 1.0, 'Violinplot of Years Played per Player')
The median number of years played is 2, and the maximum is 21. A pieplot is shown below for the same statistic:
fig = px.pie(values=(players.year_end-players.year_start).sort_index().value_counts(),
names=(players.year_end-players.year_start).value_counts().sort_index().index)
fig.show()
sns.boxplot(x=players.year_start, y=players.weight)
plt.xlabel("Year Started Playing")
plt.ylabel("Weight in Pounds")
plt.title("Boxplot of Players' Weight over Time")
Text(0.5, 1.0, "Boxplot of Players' Weight over Time")
sns.boxplot(x=players.year_end-players.year_start, y=players.weight)
plt.xlabel("Years Played")
plt.ylabel("Weight in Pounds")
plt.title("Boxplot of Players' Weight vs their Playing Experience")
Text(0.5, 1.0, "Boxplot of Players' Weight vs their Playing Experience")
The weight of the players stays about constant regardless of when they started playing or how long they played for.
In conclusion, we have explored and discussed the relationships present between various variables involved in American football as well as possible reasons behind the trends for each. While some strong trends have been found, there are also other factors present, as well as variables which have been found to be purely random.
Many factors may not be correlations, due to factors like player skill which is not quantifiable. Weather data, which was previously thought to be somewhat related to team performance, has been shown to barely affect it at all. Other factors such as the composition of the teams changing over time was also not fully taken into account when finding the trend for team performance specifically. It is also unfortunate that the timeframes of the data did not fit exactly with each other, rendering it unable to interconnect the datasets substantially to find more potentially underlying trends within the data.
About 230 of the columns from one of the datasets had to be dropped out of almost 260. This shows that the data is far from perfect, however, some of these columns may have had some data with correlation that was removed for a variety of reasons, for example, the fumble column of the play by play dataset, although it may contribute to the score, such actions are uncontrollable by the player in an actual game, so including it, although it may lead to a more substantial conclusion, it would just add a larger luck factor to the final trend.
All referenced documents have been included with the submission of proposal
https://wordpress.com/support/markdown-quick-reference/ (you may refer to this link on markup for Jupyter when formatting your proposal)